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Abstract. This paper takes on the problem of automatically identify¬ 
ing clinically-relevant patterns in medical datasets without compromising 
patient privacy. To achieve this goal, we treat datasets as a black box for 
both internal and external users of data that lets us handle clinical data 
queries directly and far more efficiently. The novelty of the approach lies 
in avoiding the data de-identification process often used as a means of 
preserving patient privacy. The implemented toolkit combines software 
engineering technologies such as Java EE and RESTful web services, 
to allow exchanging medical data in an unidentifiable XML format as 
well as restricting users to the need-to-know principle. Our technique 
also inhibits retrospective processing of data, such as attacks by an ad¬ 
versary on a medical dataset using advanced computational methods to 
reveal Protected Health Information (PHI). The approach is validated on 
an endoscopic reporting application based on openEHR and MST stan¬ 
dards. From the usability perspective, the approach can be used to query 
datasets by clinical researchers, governmental or non-governmental orga¬ 
nizations in monitoring health care services to improve quality of care. 


1 Introduction 

Patients’ Electronic Health Records (EHRs) are stored, processed, and transmit¬ 
ted across several healthcare platforms and among clinical researchers for on-line 
diagnostic services and other clinical research. This data dissemination serves as 
a basis for prevention and diagnosis of a disease and other secondary purposes 
such as health system planning, public health surveillance, and generation of 
anonymized data for testing. However, exchanging data across organizations is a 
non-trivial task because of the embodied potential for privacy intrusion. Medical 
organizations tend to have confidential agreements with patients, which strictly 
forbid them to disclose any identifiable information of the patients. Health In¬ 
surance Portability and Accountability Act (HIPAA) explicitly states the con¬ 
fidentiality protection on health information that any sharable EHRs system 
must legally comply with. To abide by these strict regulations, data custodians 
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generally use de-identificatioi:j^ techniques mmm so that any identifiable 
information on patient’s EHR can be suppressed or generalized. 

However, in reality, research Hg indicates that 87% of the population of 
U.S. can be distinguished by sex, date of birth and zip code. We can define 
quasi-identifiers as the background information about one or more people in the 
dataset. If an adversary has knowledge of these quasi-identifiers, it can possibly 
recognize an individual and take advantage of his clinical data. On the other 
hand, we can find out most of these quasi-identifiers have statistical meanings 
in clinical research. There exists a paradox between reducing the likelihood of 
disclosure risk and retaining the data quality. For instance, if information re¬ 
lated to patients’ residence was excluded from the EHR, it would disable related 
clinical partners to catch the spread of a disease. Thus, strictly filtered data may 
lead to failure in operations. Conversely, releasing data including patients’ en¬ 
tire information including residence, sex and date of birth would bring a higher 
disclosure risk. 

In this paper we address the emerging problem of de-identification tech¬ 
niques, namely, the problem of offering de-identified dataset for a secondary 
purpose that makes it possible for a prospective user to perform retrospective 
processing of medical data endangering patient privacy. Figure overviews the 
proposed technique, and the standard data request process. Our approach differs 
from the traditional techniques in the sense that it employs software engineer¬ 
ing principles to isolate and develop key requirements of data custodians and 
requesters. We apply Service-Oriented Architecture (SOA) that provides an ef¬ 
fective solution for connecting business functions across the web—both between 
and within enterprises |S]. 

We also present a prototype of our evolving toolset, implemented using web 
services to handle data queries. The results are retrieved in an XML data format 
that excludes all personal information of patients. The basic model used here 
follows the principles of RESTful web services by combining three elements: a 
URLs repository for identifying resources uniquely corresponding to clinical data 
queries, service consumers requesting data, and service producers as custodians 
of clinical data. The idea of combining web services with SQL queries is although 
not new, but it tends to provide a technological approach to avoid medical data 
re-identification risks. The implemented toolkit uses Java EE that offers an easy 
way to develop applications using EJBs. Needless to mention that Java EE is 
widespread and is largely used by community. 

Our proof-of-concept implementation uses GastrOS, an openEHR [7] databas^ 
describing an endoscopic application. The underlying technique provides the 
ability to construct or use stored queries on a clinical dataset. Employing this 
clinical toy data warehouse of the GastrOS prototype is a useful way to demon- 

^ De-identification process is defined as a technology to delete or remove the identifi¬ 
able information such as name, and SSN from the released information, and suppress 
or generalize quasi-identifiers, such as zip code date of birth, to ensure that medical 
data is not re-identifiable (the reverse process of de-identification.) 

® http: //gastros . codeplex. com 
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(b) 

Mandatorydisclosures Use by an agent 



Fig. 1. (a) : shows a traditional lifecycle of medical datasets. Custodians can be hos¬ 
pitals, agents may be entities working on their behalf, and recipients are individuals, 
or organizations such as a pharmaceutical company (b) depicts the proposed ap¬ 
proach that links external entities to data centers using a web interface. The approach 
excludes all direct data accesses on a dataset. 


strate queries on medical data for secondary use. The proposed technique avoids 
compromising patients’ personal information without utilizing de-identification 
framework tools. For instance, the following query can be posed to GastrOS 
database using our toolkit: 

- Find the number of patients who are still susceptible to developing a Hepatitis B 
infection even after full compliance to the Hepatitis B vaccination schedule-i.e. 
the baseline and second detection dates for the HBsAg and Anti-HBs tests both 
show negative results. 

The set of clinical data queries described in the paper have been crafted with 
the help of clinical researchers at Vanderbilt University. Supporting such complex 
queries required developing a set of tools, to which this paper provides the first 
attempt. In contrast to recent developments on big data, this paper does not 
focus on the management challenges of medical dataset repositories, but rather 
focuses on software engineering solutions to deal with the challenges of querying 
medical data endangering patient privacy. Our approach mainly contributes to 
the development of privacy preserving techniques on patient data by treating 
datasets as blackbox. In this way, disclosure risks associated with patient data 
are minimized. One of the key constraints before accomplishing this goal requires 
keeping the computability with data custodians. Relocating datasets is not only 
unsafe but leads to data re-identification attempts. To ensure that legitimate 
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users access and execute clinical data queries, we implement an authentication 
and authorization mechanism using role-based access control (RBAC). RBAC 
offers a ffexible architecture that manages users from different organizations by 
assigning roles and their corresponding permissions. 

The paper proceeds as follows: Section describes the related work; Sec¬ 
tion states an application example; Section presents the technical details 
of our approach; Section overviews the clinical data queries corresponding to 
the GastrOS dataset; Section discusses the authentication and authorization 
mechanism connecting users to clinical datasets; Section summarizes the work 
and details some future research directions. 

2 Related Work 

In contrast to some of the existing techniques [T^ 0 CQ] [13] dl, our approach 
relies on advanced software engineering principles and technologies for analyz¬ 
ing clinical datasets. For example, caGrid 1.0 [T^ (now caGrid 2.0), released 
in 2006, is an approach that discusses a complex technical infrastructure for 
biomedical research through an interconnected network. It aims provide support 
for discovery, characterization, integrated access, and management of diverse and 
disparate collections of information sources, analysis methods, and applications 
in biomedical research. caGrid 1.0 has been initially designed only for cancer 
research. caGrid combines Grid computing technologies and the Web Services 
Resource Framework (WSRF) standards to provide a set of core services, toolk¬ 
its for the development and deployment of new community provided services, 
and APIs for building client applications. However, caGrid does not focus on an 
explicit query mechanism to infer details from medical datasets, as the one pro¬ 
posed here. Similar work in 0 discusses a combined interpretation of biological 
data from various sources. This work, however, considers the problem of contin¬ 
uous updates of both the structure and content of a database and proposes the 
novel database SYSTOMONAS for SYSTems biology of pseudOMONAS. Inter¬ 
estingly, this technique combines a data warehouse concept with web services. 
The data warehouse is supported by traditional ETL (extract, transform, and 
load) processes and is available at http://www.systomonas.de 

De-identification techniques for medical data have been studied and devel¬ 
oped by statisticians dealing with integrity and confidentiality issues of statis¬ 
tical data. The major techniques used for data de-identification are (i) GAT 
(Gornell Anonymization Kit) |5T], (ii) p,-Argus [TT|, and (iii) sdcMicro j^. GAT 
anonymizes data using generalization, which is proposed 0 as a method that 
specifically replaces values of quasi-identifiers into value ranges, /r-Argus is an 
acronym for Anti-Re-identification General Utility System and is based on a 
view of safe and unsafe microdata that is used at Statistics Netherlands, which 
means the rules it applies to protect data comes from practice rather than the 
precise form of rules. Developed by Statistics Austria, sdcMicro is an exten¬ 
sive system for statistical computing. Like /r-Argus, this tool implements several 
anonymization methods considering different types of variables. We have re- 
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ported a comparison on the efficacy of these numerical methods that are 
used to anonymize quasi-identifiers in order to avoid disclosing individual’s sen¬ 
sitive information. The Privacy Analytics Risk Assessment Tool (PARAT) [^is 
the only commercial product available so far for de-identifying medical data. Our 
quantitative analysis [S] of de-identification tools shows that de-identifying data 
provides no guarantee of anonymity HE]. A study [T] also shows that organiza¬ 
tions using data de-identification are vulnerable to re-identification at different 
rates. 

Another approach uni describes a special query tool developed for the In- 
dianapolis/Regenstrief Shared Pathology Informatics Network (SPIN) and in¬ 
tegrated into the Indiana Network for Patient care (INPC). This tool allows 
retrieving de-identified data sets using complex logic and auto-coded final diag¬ 
noses, and it supports multiple types of statistical analyses. However, much of 
the technical details have not been published; for example, the use of complex 
logic. This and other similar efforts m are mostly database-centric. A slightly 
similar work to this paper has been developed at Massachusetts General Hos¬ 
pital (QPID Inc.,[^, offering solutions at a commercial level, but no prototype 
is available to experiment with. A Web-based approach for enriching the capa¬ 
bilities of the data-querying system is also developed m that considers three 
important aspects including the interface design used for query formulation, the 
representation of query results, and the models employed for formulating query 
criteria. The notion of differential privacy |1] aims to provide means to maximize 
the accuracy of queries from statistical databases while minimizing the chances 
of identifying its records. 

Our analysis shows that the effort to secure medical datasets is mainly two- 
faceted: 1) most research endeavors have explored the design and development 
of de-identification tools, and, 2) some work, mostly led by medical doctors, 
has tried to address the construction of clinical queries, but they do not pro¬ 
vide technical details on the construction of their toolsets. Our approach that 
treats medical datasets as blackbox mainly considers the automation of services 
expected from a data custodian in order to minimize data disclosure risks and 
making clinical datasets easily accessible for internal and external users. 

3 GastrOS: An Example Application 

GastrO^ an openEHR database describing an endoscopic application, is used 
as a case-study of electronic medical data. This application formed part of the re¬ 
search done at University of Auckland by Koray Atlag in 2010 that investigated 
software maintainability and interoperability. For this, the domain knowledge 
model of Archetypes and Templates of openEHR has driven the generation of 
its graphical user interface. Moreover, the data content depicting the employed 
terminology, record structure and semantics were based on the Minimal Standard 

® http://www.privacyanalytics.ca/software/ 

^ http:// WWW .qpidhealth.com 
® http: //gastros . codeplex. com 
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Terminology for Digestive Endoscopy (MST) specified by the World Organiza¬ 
tion of Digestive Endoscopy (OMED) as its official standard. 

Employing the clinical toy data warehouse of the GastrOS prototype is a 
useful way to demonstrate clinical research based queries on medical data for 
secondary use without compromising patients’ personal information by using the 
approach proposed here. The queries shown here focus on endoscopic findings 
that provide valuable anonymized information to clinicians. The implemented 
queries are to be mainly used by medical practitioners and health decision¬ 
makers alike to help them in their clinical management of patients at the point- 
of-care and in formulating appropriate health policies, respectively. For example, 
the following queries are obtained through brainstorming with medical doctors 
to illustrate our approach. 

- Total number of dialysis endoscopic examination from January 1, 2010 to De¬ 
cember 31, 2010. 

Top 5 diagnoses for those patients who received endoscopic examination and 
the number of cases for each diagnosis from January 1, 2010 to December 31, 
2010 . 

- Age profile of endoscopic patients from January 1, 2010 to December 31, 2010 
? i.e. number of dialysis patients belonging to each of the age bracket [below 
18; 18 to below 40; 40 to below 60; 60 and above. 

- Number of patients who are still susceptible to developing a Hepatitis B infection 
even after full compliance to the Hepatitis B vaccination schedule?-i.e., the 
baseline and second detection dates for the HBsAg and Anti-HBs tests both 
show negative results. 

The queries given above are only a subset of original queries. The database 
structure of GastrOS application is described below. 


3.1 GastrOS data structure 

Figurej^describes the data structure of the GastrOS database. GastrOS database 
contains the following tables: the clinicaldetection (doctor detection records), pa¬ 
tient (patient information), and examination (examination records) tables are 
stored in the database. 

The patient table has two relations: one patient may have more than one 
clinical detection record or examination record by doctor(s), so the patient id is 
added as a foreign key in tables ClinicalDection and Examination. GastrOS is a 
toy database example with insufficient amount of data available. The original 
database contains less than 20 rows in each table that makes is not useful for our 
SQL queries. Therefore, we automatically generated virtual data of 10,000 entries 
(note that any real data on patients also cannot be published.) An example of the 
generated data is given in Figure Table [^provides the up-to-date information 
on the number of entries in each column of the GastrOS database. 
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_J clinicaldetection ▼ 


' 1 

_J patient ▼ 


t DetectionJD MEDIUMINT(9) 


' PID MEDIUMINT(9) 


0 PaSenUD MEDIUMINT(9) 


Name VARCHAR(255) 


^ examination 

DetectedDate DATE 


Surname VARCHAR(255) 


t ReportJD MED1UMINT(9) 

Times TEXT 


Gender CHAR(l) 


0 PatientJD MEDIUMINT(9) 

AntiHBs TEXT 

H-O 

DoB DATE 

lO-K 

Endoscopy_Date DATETIME 

HBsAg TEXT 


Country VARCHAR(IOO) 


Oiagnoses.Text VARCHAR(255) 

HIV TEXT 


StreetAddress VARCHAR(255) 


Doctor VARCHAR(255) 

HDVTEXr 


City VARCHAR(255) 


► 

HCV TEXT 


Postal VARCHAR(IO) 


► 


► 







Fig. 2. E-R diagram 
Table 1. Generated data in tables 


Table 

Row 

Size 

ClinicalDetection 6,393 

432 KB 

Examination 

2,020 

272 KB 

Patient 

1,881 

224 KB 

Sum 

10,294 928 KB 


PID 

Name 

Surname 

Gender 

DoB 

Country 

StreetAddress 

City 

Postal 

10000 

Adena 

Reeves 

F 

1962-08-28 

Montenegro 

P.O. Box 936, 9290 Aptent 
Ave 

Morkhoven 

71344 

10001 

Buffy 

Warner 

M 

2009-05-25 

Guinea-Bissau 

P.O. Box 624, 5536 Nunc St. 

Graz 

68114- 

186 

10002 

Kaye 

Green 

F 

1994-07-23 

Norway 

P.O. Box 650, 1264 Tellus. 

St. 

Bojano 

T04 

3WO 

10003 

Keiko 

Gonzalez 

M 

1973-12-27 

Iraq 

1889 Magna. Street 

Chelsea 

8064 

10004 

Kylynn 

Carver 

F 

1974-01-22 

Tanzania 

Ap #357-247 PerRd. 

Oberhausen 

53534 

10005 

Daquan 

Sosa 

F 

1961-12-28 

Holy See (Vatican City 
State) 

Ap #727-5534 Mauris, 

Avenue 

Eberswalde- 

Finow 

5690ER 

10006 

Rebekah 

Navarro 

F 

1974-02-01 

Saudi Arabia 

P.O. Box 698. 3686 Dul. 
Avenue 

Wolvertem 

2976 

10007 

Zane 

Benson 

M 

2002-10-19 

Mauritania 

Ap #852-3480 Omare Ave 

Dufftown 

1137 

10008 

Jennifer 

Pettv 

F 

1985-08-31 

Isle of Man 

598-2436 Sit Rd, 

Bathurst 

71612 


Fig. 3. Data generated of patient table 


4 The Proposed Technique 


The proposed approach implements a three-tier application and is devoid of 
releasing medical datasets, as opposed to traditional techniques. The major pur¬ 
pose and characteristic of the technique extends relatively new software tech¬ 
nologies for supporting clinical data queries. In order to support clinical queries 
under consideration, we develop an integrated application using SOA and Java 
EE (Enterprise Edition), to extract data from GastrOS database. There are 
a plenty of other commercial containers such as JBOSS (Redhat), Websphere 
(IBM), Weblogic and Glassfish (Oracle), which could be used for our purpose. 
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However, our prototype tool combines Java EE based on JSF Primeface, EJB, 
and Java Persistence Architecture API (JPA). JPA is a Java specification for 
accessing, persisting, and managing data between Java objects / classes and a 
relational database. REST architecture, underlying RESTful web services, treats 
everything as a resource and is identified by an URL Resources are handled us¬ 
ing POST, GET, PUT, DELETE operations that are identical to Create, Read, 
Update and Delete (CRUD) operations. Note that in our toolkit it is suffice to 
implement Read operations for handling the described queries. Every request 
from a client is handled independently, and it must contain all the required 
information to interpret the request. 


Welcome 

ofganA 


Home Introduction » QueryList •• Contact I o search 


Id Description ^ SQL Query 

Service URL ' 

Along with Ajax 

Top 5 diagnoses for those 
patients who received 
endoscopic examination 

2 and the number of cases 
for each diagnosis from 
January 1,2010to 
December 31,2010. 

SELECT Diagnoses Text, COLINT( Diagnoses Text) AS Num FROM Examination GROUP BY Diagnoses Text 
HAVING COUNT( Diagnoses_Text) >0 LIMIT 5 

rws/querytwo 

RunNoA 

•• RunA 

Age profile of endoscopic 
patients from January 1, 
2010 to December 31, 
2010- i.e. Number of 

3 dialysis patients belonging 
to each of the age bracket 
[below 18; 18 to below 40; 

40 to below 60; 60 and 
above] 

SELECT ' FROM (SELECT DISTINCT COUNT( PID) AS NumBelow18 FROM Patient, Examination WHERE 
PatientPID = Examinatton.Patient ID AND YEAR( CURRENT DATE()) - YEAR( DoB ) <18 ) AS NumBelowlS, ( 
SELECT DISTINCT COUNT( PID ) AS Num18to40 FROM Patient, Examination WHERE Patient.PID = 
Examination.Patient ID AND YEAR( CURRENT DATE()) • YEAR( DoB ) BETWEEN 18 AND 40 ) AS 
Num18to40, ( SELECT DISTINCT COUNT( PID ) AS Num40to60 FROM PaBent, Examination WHERE 

PatientPID = Examination.Patient ID AND YEAR( CURRENT DATE()) - YEAR( DoB) BETWEEN 40 AND 60 ) 
AS Num40to60, ( SELECT DISTINCT COUNT( PID ) AS NumAbove60 FROM PaBent Examination WHERE 
PatientPID = Examination.PatientJD AND YEAR( CURRENT_DATE()) -YEAR( DoB) >60) AS NumAboveSO 

rws/querythree 

RunNoA 

► RunA 


EHR System - Version 2.0 


Fig. 4. The list of authorized roles for the Organization A 


# Code for the Restful-based web service. 

©Path("queryone") 
public class QueryOne { 

OContext 

private Urilnfo context; 

OEJB 

QueryBecin beein; 

@GET 

OProduces("application/xml") 
public String getHtmlO { 

// TODO return proper representation object 

String sql = "select Country, COUNT(Report_lD ) AS" + 
"TotalNum " + 

"FROM excunination, patient " + 

"WHERE excunination.Patient_lD = " + 
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"patient.PID " + 

"AND Endoscopy_Date " + 

"BETWEEN \’2010-1-1\’ " + 

"AND \’2010-12-30\’ " + 

"GROUP BY Country " + 

"Order By TotalNum desc 

String f = becin. query (sql) ; 
return f; 

} 

} 

\\For the method query: 
public String queryCString sql) 

String result = 

Query query = emf.createEntityManager(). 

createNativeQuery(sql); 
OSuppressWarnings("unchecked") 

List<Object []> list = query.getResultList(); 


> 

5 Implementing Clinical Queries using SOA 

Web-based authorization and authentication is enforced using role-based access 
control, before allowing any queries to be accessible by external entities. For in¬ 
stance, the first two queries are shown in Figure They are linked to Organiza¬ 
tion A, that shows a limited access varying according to the enabled permissions 
by a security administrator. Thereby, execution of the queries is managed by 
access control features of the tool. Some of the queries and their correspond¬ 
ing data are given below. SQL queries, exception results, and running time are 
presented in columns 1, 2, and 3, respectively of the Figure 

Note that XML-based format is devoid of platform and programming lan¬ 
guage dependencies. Using this Web-based approach a diverse set of queries can 
be supported to query clinical data repositories. For the RESTful-based web 
services before executing a query, it should have a URL stored in database, that 
is the table uriforwebservice. 

Note that all the data saved in a program are objects; nonetheless, our 
database has actually been represented in the form of relational tables. For this, 
it needs to implement some ORM (Object-Relational Mapping) techniques. In 
our prototype implementation we have used JPA (Java Persistence API), be¬ 
cause it comes with Java EE technique framework and can be run in either 
native SQL, or in an object form to allow data manipulation. For instance, we 
show a service code snippet above. ©Path show the URL address for this web 
service, @GET is the method of Restful-based web service, that can be used for 
other reasons such as ©UPDATE ©DELETE ©POST. Upon invoking a web service 
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using URL in browser or a session bean, the SQL can be executed and return 
result by query method which invokes the entity manager of JPA. Below, we list 
some sample clinical queries as well as their output in an XML format. 

#Number of patients for each gender who are still susceptible 
to developing a Hepatitis B infection even after full compliance 
to the Hepatitis B vaccination schedule --i.e. the baseline and 
second detection dates for the HBsAg and Anti-HBs tests both 
show negative results. 

<dataset> 

<item> 

<element>F</element> 

<element>184</element> 

</item> 

<item> 

<element>M</element> 

<element>192</element> 

</item> 

</dataset> 

# Top 5 diagnoses for those patients who received dialysis 
treatment and the number of cases for each diagnosis from 
January 1, 2010 to December 31, 2010. 

<?xml version="1.0" encoding="utf-8"?> 

<dataset> 

<item> 

<element> 

Diagnoses_Text Colon: Primary malignant tumor, 

Quiescent Crohn’s disease 
</element> 

<element>421</element> 

</item> 

<item> 

<element> 

Diagnoses_Text Esophagus: Normal, Ectopic gastric mucosa 
</element> 

<element>394</element> 

</item> 

<item> 

<element>Esophagus: Reflux esophagitis</element> 
<element>414</element> 

</item> 

<item> 

<element>Esophagus: Varices certain</element> 
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<element>406</element> 

</item> 

<item> 

<element>Esophagus:Barrett’s esophagus</element> 
<element>365</element> 

</item> 

</dataset> 


5.1 Enabling dynamic clinical queries 

The construction and execution of clinical queries on a given dataset are im¬ 
plemented through a web-interface of the tool. The interface allows a user to 
dynamically construct a clinical query on a dataset. Thus, it adds a greater flex¬ 
ibility to the query mechanism in developing user-oriented analysis of a dataset. 
For instance, Fig. [^demonstrates how to execute a query such as "Total number 
of dialysis endoscopic examination of a country starting and ending on a particular 
date, respectively.", followed by the output in Fig 


Home Introduction Dynamic Query Contact 


Dynamic Query 


Total number of endoscopic examination for different counby from start date 

9 

9 

o 

o 

<M 


to end date 2010>10'23 


- July 2010 



•u 

Mo 

Tu 

wo Th Ft 

•i 






^ 2 

3 



4 

5 

8 

7 8 9 

10 



11 

12 

13 

14 IS 16 

17 



18 

18 

20 

21 22 23 

24 



25 

28 

27 

28 28 30 

31 



Fig. 5. Interface for executing runtime clinical queries 


These queries show that all specific details on patients are avoided when 
executing a query, which also means that it disables all direct accesses to patient 
records. It is actually realized by providing a more aggregated form of data on 
patients instead of conventional techniques that provide medial datasets to infer 
such details. Note that the toolkit does not allow any query that provides specific 
information on patients, such as "Provide details of all patients with a certain 
age". These queries are directly irrelevant to researchers since they are mainly 
interested in collective analysis on a dataset. The idea of combining web services 
with SQL queries is although not new, but it tends to provide a technological 
solution to a technological problem avoiding medical data re-identification risks. 
The rationale Using Java EE stems from the fact that it provides an easy way 
to develop applications, for example, EJB are convenient to use by adding only 
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4* Cfi 127.0.0.1:S080/EHR/rYn/dq7surtdate-2010-07-01&enddate-2010-10-23 

;** Appt Qpublleuient Q Sriicm ^ Engliih ^ life Q Projects Q Program ^ bloinfornulics Q Publications Q) S^lem ^ Englisb ^ life ^Projects Q Program 

This XML file does not appear to have any style information associated with it. The document tree is shown below. 


▼<data8et> 

▼<item> 

<eleinent>BuIgaria</eleinent> 

<eleinent>3</eleinent> 

</iteni> 

▼<item> 

<element>Nicaragua</element> 

<eleinent>2</elejnent> 

</item> 

▼ <iteni> 

<eleinent>Kir ibati</ elejnent> 

<e1ernent> 2 </e1ement> 

</iteni> 

▼<item> 

<element>Holy See (Vatican City State)</element> 
<eleinent>2</element> 

</item> 

T<item> 

<eleinent>Heard Island and Mcdonald l8lands</element> 
<eleinent>2</element> 

</item> 

▼<item> 

<eleinent>Libya</eleinent> 

<eleinent>2</element> 

</item> 

▼<item> 

<element>Azerbaijan</eleinent> 

<eleinent>2</element> 

</iteni> 


Fig. 6. The retrieved data in XML format corresponding to the query in 


one annotation. Java EE is also widespread being largely used both in academia 
and industry. 


6 Authentication and Authorization Process 

Our toolset implements the Role-based Access Control (RBAC) [6] [16] |17| . 
RBAC provides a suitable mechanism to restrict user’s access on resources, such 
as to perform operations including insert, delete, append, and update on a med¬ 
ical dataset. The data model of RBAC is based on five data types: users, roles, 
objects, permissions and executable operations by users on objects. 

A sixth data type, session, is used to associate roles temporarily to users. A role is 
considered a permanent position in an organization whereas a given user can be 
switched with another user for that role. Thus, rights are offered to roles instead 
of users. Roles are assigned to permissions that can later be exercised by users 
playing these roles. Modeled objects in RBAC are potential resources to protect. 
Operations are viewed as application-specific user functions. For example. Fig. 
shows a list of queries provided to an administrator role. 

To maintain a set of permissions on GastrOS database, we use the constructs 
from RBAC maintain, and enlist entries in corresponding tables users, roles, 
textsfquerytoroles, querylist, and uriforwebservice. The database tables include 
user, role, querylist, querytorole, and uriwebservice. We create a user account in 
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Fig. 7. Query list for role of administrator 




user 


' id INT(11) 

username VARCHAR{20) 
password VARCHAR(20) 
* role INT(ll) 


^ role 


' id INT(ll) 
name VARCHARdOO) 


3 urlforwebservice 


' id INT(11) 
uri VARCHAR(500) 


^ querytorole 


t queryid INT(11) 

-K ♦ roleid INT{11) 

description VARCHAR(IOO) 




—K 


3 querylist 


* id INT(11) 

description VARCHAR(1000) 
sqlquery VARCHARdOOO) 

* uriid INT(11) 


Fig. 8. E-R diagram 


user table with the assigned role. Here, all the roles are defined in role table. 
Users privileges and a list of queries are defined in tables querytorole and querylist, 
respectively. URLs are stored in the uriwebservice table. For example, logging in 
as administrator provides five SQL queries shown in Figure whereas logging 
in as organization A allows a restricted set of SQL queries as given in Figure 
^ Security management is supervised by an administrator who can do deletion, 
addition of roles as required. Using RBAC allows users to take multiple roles, for 
example, the user X could act as researcher that belongs to organization A, but 
can be assigned another role from the set of roles. Similarly, a permission can 
be associated to many roles depending on the RBAC policy. The multi-to-multi 
relation between roles and queries that is given in the querytorole table. 
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6.1 Avoiding SQL injections and sensitive information release 

Web application security vulnerabilities occur in cases when an attacker or a 
authorized user tries to submit and execute a database SQL command on a web 
application, and thus, a back-end database is exposed to an adversary. These 
SQL injections can be avoided if queries are validated and filtered before their 
execution, and are checked against input data or any encoding made by a user. 
To prevent similar security issue in our web application we first authenticate the 
user input against a set of defined rules given below: 

BlockList = {name, age, address, zipcode} 

Anti-injectionList = {'," , etc.} 

Note that the special characters given in a block list helps to avoid SQL 
injections. The set BlockList disables all possible access to attributes in a table 
such as name, age, address, and zip code to keep the fetched data completely 
anonymized. Set members in injection List filters out three possible vulnerable 
inputs, i.e., ,, etc. so that any similar attempts could be restricted. Here are the 
filters that check inputs against BlockList, injectionList. Before running a web 
service, these two atomic services are always invoked to avoid identifying the 
actual patients and SQL injections. 

— Service one: Checks input string for characters in BlockList. 

bool CheckDeldentificationCString s) 

Check Input string s, 

if it contain character in BlockList, 

return false. Otherwise true. 

> 


— Service two: Checks input string for characters in Anti-injectionList. 

bool CheckInjectionCString s) 

{ 

Check Input string s, 

if it contain character in Anti-injectionList, 
return false. Otherwise true. 

} 

7 Conclusions and Future Perspectives 

We presented a technique for automatic identification of clinically-relevant pat¬ 
terns in medical data. The main contribution of this paper is in defining and 
presenting an alternative approach to the data de-identification techniques com¬ 
monly employed for anonymizing clinical datasets. Our technique treats datasets 
as blackbox and allows data custodians to handle clinical data queries directly. 
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Relocating a dataset not only endangers anonymity of patients, it allows adver¬ 
saries to apply advanced computational methods for retrospective processing of 
data. As clinical data is frequently updated, our approach enables data custo¬ 
dians to provide up-to-date resources to their users. We integrate RESTful web 
services and Java EE with a backend clinical database exchanging anonymous 
XML data, enabling them to be language and technology independent. Java EE, 
due to equipped with EJBs, is easy to use for developing applications. 

In circumstances related to sharing of patients’ data, complex administrative 
regulations are placed at different levels of management that sometimes unneces¬ 
sarily complicate the data acquisition process. Providing a tool support for link¬ 
ing data custodians and data requesters using software engineering techniques 
could pave the way to query clinical datasets more transparently and system¬ 
atically. We explored new ways of anonymously analyzing clinical datasets. Our 
future work includes expanding the approach to more complex databases and 
supporting an enriched interface for analyzing bigger data repositories. We are 
currently dealing with the challenge of replacing de-identification techniques in 
use for de-identifying specific attributes in a database table, for example, pa¬ 
tient id, and a doctor needing to find patients who had an increase of systolic 
blood pressure within a specific period, or patients with steady states of diastolic 
blood pressure for more than a week. Our future work considers incorporating 
such queries into the toolset, including implementing ETL processes such as 
in data warehouses to support clinical data analyses on large-scale integrated 
databases. 
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